pandas data cleaning for web scraping is one of the fastest ways to turn messy crawler outputs into analysis-ready datasets. In this chapter, you will learn how to extract tables from web pages, convert text fields into numeric values, handle missing and duplicate rows, and merge multiple datasets efficiently.
Before we start, note that each tutorial in this series is independent. However, if you want the full learning path, you can review the earlier chapters here:
Outbound references (authoritative docs):
read_html docs: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
pandas docs: https://pandas.pydata.org/docs/
9. pandas data processing
pandas is one of the most practical libraries in the Python ecosystem for cleaning, transforming, and merging crawler data. Therefore, if you master pandas well, you will process data faster and more consistently than relying on spreadsheets.
Install pandas
pip install pandas
9.1 Extract HTML tables with
read_html
In many scraping scenarios, the page already contains structured tables. Instead of manually parsing HTML, you can let pandas extract tables directly.
For example, you can read tables from Wikipedia:
import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/Minnesota")
print(f"Total tables: {len(tables)}")
If you only want a specific table, you can match by table title text:
tables = pd.read_html(
"https://en.wikipedia.org/wiki/Minnesota",
match="Election results from statewide races"
)
df = tables[0]
df.head()
Next, check column types:
df.info()
9.2 Convert percentage strings to numeric values
Often, scraped numbers come with symbols such as %. As a result, direct conversion will fail:
df["GOP"].astype("float")
You can remove % first and then convert:
df["GOP"].replace({"%": ""}, regex=True).astype("float")
If you want to convert multiple columns at once, remove % globally and then apply numeric conversion:
df = df.replace({"%": ""}, regex=True)
df[["GOP", "DFL", "Others"]] = df[["GOP", "DFL", "Others"]].apply(pd.to_numeric)
df.info()
This step is typical in pandas data cleaning for web scraping, because websites often format numbers for humans rather than for analysis.
9.3 Handle missing values (NaN)
Missing values can distort metrics and aggregations. Therefore, you should detect them early.
Detect missing values:
isnull()
import pandas as pd
df = pd.DataFrame({
"A": [1, 2, None, 4],
"B": [None, 2, 3, 4],
"C": [1, None, None, 4]
})
print(df.isnull())
Count missing values per column:
isnull().sum()
print(df.isnull().sum())
Remove missing rows or columns:
dropna()
df_cleaned_rows = df.dropna() # drop rows with any NaN
df_cleaned_cols = df.dropna(axis=1) # drop columns with any NaN
In practice, you often combine dropping with filling. For example, you may fill categorical fields with “unknown” and keep numeric fields for analysis.
9.4 Handle duplicate values
Scraped data often contains duplicates because pages repeat items or APIs return overlapping results. So you should detect and remove duplicates.
Detect duplicates:
duplicated()
import pandas as pd
df = pd.DataFrame({
"A": [1, 2, 2, 4],
"B": [5, 6, 6, 8]
})
print(df.duplicated())
Drop duplicates:
drop_duplicates()
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
9.5 Common string and type conversions
Crawler outputs frequently store numbers and time as strings. Therefore, you should normalize them before analysis.
Useful pandas operations:
- pd.to_datetime() to parse time fields
- str.strip() to remove leading/trailing whitespace
- str.lower() to normalize case
- str.replace() to standardize values
Example:
import pandas as pd
df = pd.DataFrame({
"A": ["1", "2", "3", "4"],
"B": ["2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01"],
"C": [" hello ", " world ", " old_value ", " new_value "]
})
df["A"] = df["A"].astype(float)
df["B"] = pd.to_datetime(df["B"], errors="coerce")
df["C"] = df["C"].str.strip()
df["C"] = df["C"].str.lower()
df["C"] = df["C"].replace({"old_value": "new_value"})
print(df)
9.6 Regular expressions on columns
Regex is common in scraping, especially for extracting and standardizing text.
Extract with regex
df["phone_area"] = df["phone"].str.extract(r"\((\d{3})\)")
Vectorized string concatenation
df["name"] = df["first_name"].str.cat(df["last_name"], sep=" ")
Fuzzy matching (optional)
If you need approximate matching, you can use fuzzywuzzy:
pip install fuzzywuzzy
from fuzzywuzzy import fuzz
df["similarity"] = df.apply(lambda x: fuzz.ratio(x["name1"], x["name2"]), axis=1)
9.7 High-performance data merging
After pandas data cleaning for web scraping, you often merge datasets, for example:
- user profiles + order lists
- product tables + inventory tables
- company A data + company B data
Fast merging based on index:
join()
result = df1.join(df2, how="left")
Example:
import pandas as pd
df1 = pd.DataFrame({"A": ["A0", "A1"], "B": ["B0", "B1"]}, index=["K0", "K1"])
df2 = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])
print(df1.join(df2))
Merge by key:
merge()
result = pd.merge(df1, df2, on="key", how="outer", indicator=True)
The indicator=True flag helps you track which rows come from which table. As a result, debugging merges becomes easier.
Concatenate by axis:
concat()
pd.concat([df1, df2], axis=1, keys=["2024", "2025"])
concat() works well when you stack rows or align columns by index. Moreover, it supports join=”inner” to keep only overlapping columns.
9.8 Save cleaned data into a database
Once your dataset has no obvious missing values, duplicates, or formatting issues, you can store it.
For example, you can save a DataFrame into SQL using to_sql:
df.to_sql("table_name", con)
At this point, the stored data is cleaner, more consistent, and ready for downstream queries, dashboards, or APIs.
Summary
This chapter showed a practical workflow for pandas data cleaning for web scraping: extract tables, normalize numeric fields, handle missing/duplicate values, apply string cleanup, use regex, merge datasets, and finally store results into a database. As a result, your crawler output becomes analysis-ready data instead of raw noise.